MCMC

Jesse Brunner

Motivation for MCMC

Many models are too complex for quadratic approximation or other short cuts.

  • Many dimensions
  • Interactions
  • Multi-level models (we’ll get there)

We need a general method of numeric integration to explore the posterior effectively.

MCMC \(\lim_{t\to\infty}\) posterior

What’s in a name? Markov chain Monte Carlo

Monte Carlo: “repeated random sampling to obtain numerical results” -Wikipedia

Markov chain: stochastic model in which state at time \(t\) depends only on previous state at \(t-1\)

Imagine a posterior

Imagine this one-dimensional posterior

  • The x-axis is our parameter, \(\theta\).
  • The y-axis is the posterior probability.

Imagine a posterior

Imagine this one-dimensional posterior:

  • The x-axis is our parameter, \(\theta\).
  • The y-axis is the posterior probability.
    • actually, is proportional to the posterior
    • we don’t know the probability of the data, \(\text{Pr}(x)\).

Imagine a posterior

Now imagine that we cannot see it!

We can only calculate the heights given particular guesses of \(\theta\)

Step 1

Imagine we take a guess for \(\theta\) This is the first step of our (Markov) chain

Step 1

We can then find the height for this guess

height = likelihood of data given \(\theta\) \(\times\) prior of \(\theta\)
= \(\text{Pr}(x|\theta) \times \text{Pr}(\theta)\).

Step 2

Choose a proposal for a new value on the chain. - Our proposal distribution (blue) is centered on our last guess (gray). - Can be any symmetric shape.

Step 2

Step 2

Find the height (posterior-ish value) for our proposal

Step 3

Calculate the ratio of those heights: \[ r = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta_{t})\text{Pr}(\theta)} = \frac{0.078}{0.07} = 1.124 \]

Choose a random uniform variable, \(U \sim \text{Uniform}(0,1)\)

If \(r > U\), accept the proposal & add it to the chain. If \(r < U\), reject it, add current value to chain, & try a new proposal.

Step 3

Why ratios are clever

We want to calculate the ratio of posterior probabilities:

\[ r = \frac{\text{Pr}(\theta_p | x)}{\text{Pr}(\theta_{t} | x)} = \frac{\frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x)}}{\frac{\text{Pr}(x|\theta_{t})\text{Pr}(\theta_{t})}{\text{Pr}(x)}} \] but we do not know (and cannot calculate) \(\text{Pr}(x)\).

However, they cancel out! We just need the numerators! \[ r = \frac{\text{Pr}(\theta_p | x)}{\text{Pr}(\theta_{t} | x)} = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta_t)\text{Pr}(\theta_t)} \]

Repeat steps 2 & 3

Repeat steps 2 & 3

Repeat steps 2 & 3

\[ r = \frac{0.061}{0.078} = 0.781 ; U = 0.89 \]

Since \(r < U\), reject proposal, add current value to chain, & try a new proposal.

One more time

One more time

\[ r = \frac{0.079}{0.078} = 1.004 \] \(U=0.212\)

Since \(r > U\), accept proposal & add to chain

Chain so far

Chain after a while

Turn it on it’s side

It works!

This is the Metropolis Algorithm

Did you catch the problem?

What if we get proposals outside of possible boundaries?

  • lots of rejections
  • inefficient sampling

Solution: asymmetric proposal distribution

However, proposal distribution is \(a\)symmetric:

\[ \text{Pr}(\theta_{t-1} \rightarrow \theta_p) \neq \text{Pr}(\theta_{p} \rightarrow \theta_{t-1}) \]

Metropolis-Hastings algorithm

Adjust the ratio to: \[ r = \frac{\text{Pr}(x|\theta_p)\text{Pr}(\theta_p)}{\text{Pr}(x|\theta_{t-1})\text{Pr}(\theta_{t-1})}\times \frac{\text{Pr}(\theta_{t-1}|\theta_p)}{\text{Pr}(\theta_p|\theta_{t-1})} \]

and it works!

Gibbs sampling

A widespread alternative is Gibbs sampling (WinBugs, JAGS)

Like moving along posterior one dimension at a time

  1. Start with an initial guess
  2. For \(\theta_1\), sample it conditional on the other variables (\(\theta_2, \theta_3,...,\theta_n\)) in their prior state
    • proposal always has a ratio of 1… is always accepted
  3. Update that variable with this new sample
  4. Repeat for the remaining variables (one iteration)
  5. Keep repeating for steps 2–4

Suffers from correlation among variables

But… higher dimensions cause problems

Imagine a proposed jump from one of these points could easily be far from posterior density

  • Gets worse with higher N (There are more ways to go in the wrong direction)
  • Many to most proposals are rejected \(\longrightarrow\) inefficient

Gibbs sampling gets slowed down, too.

Solution: HMC

Hamiltonian Monte Carlo

  • pretends the (negative-log) posterior is a surface
  • simulates a ball rolling along that (without friction)
  • produces proposals informed by topography
  • thus proposals (that fit some criteria) are always accepted

HMC

First, take the negative log, so high probability = low areas

HMC

First, take the negative log, so high probability = low areas

HMC

But remember, we cannot see it

We can only calculate the height given a value of \(\theta\)

HMC

Take a first guess

HMC

…and give it a bit of momentum (in a direction of parameter space)

Distribution of momentum comes from standard (multivariate) normal distribution

HMC

Track it’s movement for a certain amount of “time” using Hamiltonian equations

\[ \begin{align} \text{Total Energy} &= \text{Potential} + \text{Kinetic}\\ E &= U(\theta) + K(\text{momentum}) \end{align} \]

After some pre-determined amount of time, the position of the point is the new proposal

HMC

Track it’s movement for a certain amount of “time” using Hamiltonian equations

\[ \begin{align} \text{Total Energy} &= \text{Potential} + \text{Kinetic}\\ E &= U(\theta) + K(\text{momentum}) \end{align} \]

Proposals tend to be in areas with higher probability density

HMC details

No analytic solution to Hamiltonian for most problems, so:

  • solve it numerically over a series of steps (Leapfrog algorithm)

HMC details

No analytic solution to Hamiltonian for most problems, so:

  • solve it numerically over a series of steps (Leapfrog algorithm)
  • Stan will tune step size and number of leapfrog steps
    • too many steps and we end up at the bottom, every time
    • too coarse and we don’t follow the posterior
  • Total energy (\(= \text{Potential} + \text{Kinetic}\)) at start should equal the energy at the end if we followed posterior surface (No friction!)
    • gives a warning if these diverge

Proposal acceptance actually reflects the difference in energy (i.e., how well we followed the path of the ball)

  • unless there were divergences, we should accept
  • divergences suggest a problem we need to fix

HMC details

Overall, each proposal requires many more calculations (many steps, calculating kinetic energy & momentum at each), but proposals are much better / rarely rejected, so much more efficient overall

  • Most of the time with Stan is compiling code in C++
  • Running is fast!
  • Get warnings of fitting problems practically for free!